Welcome to the Festival of Genomics workshop on single-cell analyses

This is an R Markdown document that contains instructions and code for the examples used in todays workshop. The first few steps will check that you have all the packages and data files necessary to carry out all of the analyses.

Hour 1: Getting Started

Sara’s section …

Check that the Brain Atlas data files are present

The following code chunk assumes that Brain Atlas files have been downloaded and placed in the “BrainAtlas” subdirectory of a folder in your home directory entitled “FestivalWorkshopSC”. If you have downloaded these files to another location, either create a new folder and move the files there, or substitute the file path to where they are currently located for “~/FestivalWorkshopSC/BrainAtlas”. If you are using the RStudio Server instance provided by the workshop, change the first line that follows to setwd("/home/FestivalWorkshopSC/BrainAtlas")

setwd("~/FestivalWorkshopSC/BrainAtlas")
file.exists("cell_metadata.csv")
## [1] TRUE
file.exists("genes_counts.csv")
## [1] TRUE
file.exists("README.txt")
## [1] TRUE

If any of the preceding lines return FALSE, double check that you have set the correct working directory and that all download files have been placed in that folder. If they are missing, you can download the files here. Note that there will be more files than listed here (including alternate gene count quantifications, and metadata about data-driven cluster memberships as discussed in the paper “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics”), but these are the ones we will make use of.

Read the Brain Atlas data files into R

The read.csv function in R is useful for reading in .csv (comma-separated value) files. First, we’ll read in the main data file genes_counts.csv to a data frame using this function and check its contents. The extra arguments to this function help to format our object so that we have row and column names, and we use character variables instead of converting to factors. For more details on these arguments, you can type help(read.csv). In the resulting coutns object, we have genes in rows (24057) and cells in columns (1679).

counts <- read.csv("genes_counts.csv", stringsAsFactors = FALSE, header=TRUE, row.names = 1)
str(counts[,1:20]) # restrict to the first 20 columns (cells) 
## 'data.frame':    24057 obs. of  20 variables:
##  $ A01101401: num  0 992 2.57 0 0 0 0 0 0 1 ...
##  $ A01101402: num  0 2287 177 0 0 ...
##  $ A01101403: num  0 492 0 0 0 ...
##  $ A01101404: num  0 1932 1 0 0 ...
##  $ A01101405: num  0 1425 2 0 0 ...
##  $ A01101406: num  0 130 3 0 0 ...
##  $ A01101407: num  0 2110 3041 0 17 ...
##  $ A01101408: num  0 955 101 0 0 ...
##  $ A02271433: num  0 326 0 0 0 349 0 0 0 0 ...
##  $ A02271434: num  0 933 1042 0 0 ...
##  $ A02271435: num  0 296 0 0 0 ...
##  $ A02271436: num  0 31 200 0 0 984 0 0 0 36 ...
##  $ A02271437: num  0 970 0 0 3 ...
##  $ A02271438: num  0 594 355 0 228 ...
##  $ A02271439: num  0 2290 3294 0 0 ...
##  $ A02271440: num  0 29 1 0 0 ...
##  $ A12101401: num  0 373 540 0 0 ...
##  $ A12101402: num  0 462 197 0 0 0 0 0 0 0 ...
##  $ A12101403: num  0 516 96 0 498 ...
##  $ A12101404: num  0 1019 0 0 0 ...

The cell_metadata.csv file contains 1679 rows (one for each cell) and columns containing information such as collection date, sequencing type, total reads, mapping percentage, dissection layer, and major/minor derived cell subtypes.

cells <- read.csv("cell_metadata.csv", stringsAsFactors = FALSE, header = TRUE)
str(cells)
## 'data.frame':    1679 obs. of  16 variables:
##  $ long_name         : chr  "A01101401" "A01101402" "A01101403" "A01101404" ...
##  $ cre               : chr  "Calb2" "Calb2" "Calb2" "Calb2" ...
##  $ collection_date   : chr  "11/18/2013" "11/18/2013" "11/18/2013" "11/18/2013" ...
##  $ sequencing_type   : chr  "hiseq" "hiseq" "hiseq" "hiseq" ...
##  $ total_reads       : int  23770190 9694719 5864322 22102121 24057147 24171169 22447919 20995719 9023705 23016406 ...
##  $ all_mapped_percent: num  93.5 92.9 90.5 93.2 93.1 ...
##  $ mRNA_percent      : num  54.4 45.7 48.3 51.4 51.1 ...
##  $ genome_percent    : num  30.1 35.5 34 33.8 32.1 ...
##  $ ercc_percent      : num  4.36 7.84 4.12 4.24 4.98 3.14 3.12 2.27 3.22 2.13 ...
##  $ tdt_permillion    : num  306 341 106 371 264 ...
##  $ major_class       : chr  "Inhibitory" "Inhibitory" "Excitatory" "Inhibitory" ...
##  $ sub_class         : chr  "Vip" "Vip" "L4" "Vip" ...
##  $ major_dissection  : chr  "V1" "V1" "V1" "V1" ...
##  $ layer_dissectoin  : chr  "All" "All" "All" "All" ...
##  $ color_code        : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ short_name        : chr  "A200_V" "A201_V" "A202_V" "A203_V" ...

More detailed information about the files downloaded from the Allen Brain Atlas can be found in the README.txt file provided. Here is a peek at the contents of that file.

# This is a bash command, to be executed at the command line (not within R);
# Alternatively, simply open the README.txt in your favorite text editor to view its contents
head README.txt
## Description of files contained in this data download:
## 1.   genes_counts.csv: File containing read count values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 2.   genes_rpkm.csv: File containing the RPKM values obtained from the RSEM algorithm for each gene (row) for each cell (column)
## 3.   ercc_counts.csv: File containing read count values obtained from the RSEM algorithm for each external ERCC spike-in control (row) for each cell (column)
## 4.  cell_metadata.csv: File containing information about each cell profiled, including its nomenclature, Cre line of origin, dissection, date of collection and sequencing, and read mapping statistics
## 5.   cluster_metadata.csv: File containing information about each data-driven cluster, including its label, the corresponding label in Tasic et al. (Nat. Neuro, 2106), the primary cell class membership, and marker genes (including genes with widespread expression in the cluster, sparse expression in the cluster, and no expression in the cluster). 
## 6.   cell_classification.csv: File containing information about the cluster membership of each cell, including whether the cell is a "core" (unambiguously assigned to a single cluster) or "transition" (shares membership between two clusters) cell, as well as its membership score (from 0-10) for each cluster (labeled f01 to f49). 
## 
## To generate the count and RPKM data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013.
## Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585

Check that the desired R packages have been installed

Once a list of desired R packages is finalized, can check that they are installed with

require(scde)     #bioconductor
require(monocle)  #bioconductor
require(sincell)  #bioconductor
require(scDD)     #github
require(ggplot2)  #cran
require(devtools) #cran

If any of these commands return a message that includes “there is no package called…”, then the package is missing and needs to be installed. Note that packages may be stored in one of several package repositories. The most popular are Bioconductor, github, and CRAN. For Bioconductor packages, for example edgeR, this can be done with the following code:

source("http://bioconductor.org/biocLite.R")
biocLite("monocle")

For CRAN packages, for example devtools, installation can be done with the following code:

install.packages(devtools)

For Github packages, for example scDD, installation can be done with the following code:

install.packages("devtools")
devtools::install_github("kdkorthauer/scDD")

Visualize major axes of variation in a PCA plot

# extract top 500 variable genes
gene.var <- apply(counts, 1, function(x) var(log(x[x>0])))
counts.top500 <- counts[which(rank(-gene.var)<=500),]

counts.pca <- prcomp(log(counts.top500+1),
                   center = TRUE,
                   scale. = TRUE) 
summary(counts.pca)$importance[,1:5]
##                             PC1      PC2      PC3      PC4      PC5
## Standard deviation     25.44755 15.98222 8.598395 7.371017 7.197588
## Proportion of Variance  0.38569  0.15213 0.044030 0.032360 0.030850
## Cumulative Proportion   0.38569  0.53783 0.581860 0.614220 0.645070
plot(counts.pca, type="l", main="Top 10 PCs")

color_class <- rainbow(length(unique(cells$major_class)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2], 
      xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$major_class))], pch=20,
      main="PCA plot of cells colored by derived major class")

color_class <- rainbow(length(unique(cells$layer_dissectoin)))
plot(counts.pca$rotation[,1], counts.pca$rotation[,2], 
      xlab="PC 1", ylab="PC 2", col=color_class[as.numeric(factor(cells$layer_dissectoin))], pch=20,
      main="PCA plot of cells colored by Dissection Layer")

Hour 2: Normalization and Quality Control of scRNA-seq

Apua’s Section …

Normalization

Quality Control (QC measures)

detectionRate <- apply(counts, 2, function(x) sum(x > 0) / length(x))
hist(detectionRate)

Hour 3: Analysis Modules

Keegan’s Section …

Identify the highly variable genes

library(sincell)

Identify differentially expressed genes

library(scde)
library(scDD)

Order cells by “Psuedotime” (temporal-spatial variation)

library(monocle)

Identify genes with oscillating expression patterns

library(Oscope)
## Loading required package: EBSeq
## Loading required package: blockmodeling
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
## Loading required package: testthat
## 
## Attaching package: 'testthat'
## The following object is masked from 'package:igraph':
## 
##     compare
## Loading required package: cluster
## Loading required package: BiocParallel
## Warning: package 'BiocParallel' was built under R version 3.2.3

TODO: Fill in the above analysis modules …